Finite-State Technology in Natural Language Processing
نویسنده
چکیده
Finite-state technology is at the core of many standard approaches in natural language processing [1, 2]. However, the terminology and the notations differ significantly between theoretical computer science (TCS) [3] and natural language processing (NLP) [4]. In this lecture, inspired by [2, 4], we plan to illustrate the close ties between formal language theory as discussed in TCS and its use in mainstream applications of NLP. In addition, we will try to match the different terminologies in three example tasks. Overall, this lecture shall serve as an introduction to (i) these tasks and (ii) the use of finite-state technology in NLP and shall encourage closer collaboration between TCS and NLP. We will start with the task of part-of-speech tagging [2, Chapter 5], in which given a natural language sentence the task is to derive the word category (the part-of-speech, e.g. noun, verb, adjective, etc.) for each occurring word in the sentence. The part-of-speech information is essential for several downstream applications like co-reference resolution [2, Chapter 21] (i.e., detecting which entities in a text refer to the same entities), automatic keyword detection [2, Chapter 22] (i.e., finding relevant terms for a document), and sentiment analysis [5] (i.e., the process of determining whether a text speaks favorably or negatively about a subject). Along the historical development of systems for this task [6] we will discuss the main performance breakthrough (in the mid 80s) that led to the systems that are currently state-of-the-art for this task. This breakthrough was achieved with the help of statistical finite-state systems commonly called hidden Markov models [2, Chapter 6], which roughly equate to probabilistic finite-state transducers [7]. We will outline the connection and also demonstrate how various well-known algorithms like the forward and backward algorithms relate to TCS concepts. Second, we will discuss the task of parsing [2, Chapter 13], in which a sentence is given and its syntactic structure is to be determined. The syntactic structure is beneficial in several applications including syntax-based machine translation [8] or natural language understanding [2, Chapter 18]. In parsing, a major performance breakthrough was obtained in 2005 by adding finite-state information to probabilistic context-free grammars [9]. The currently state-of-the-art models
منابع مشابه
Strengths and weaknesses of finite-state technology: a case study in morphological grammar development
Finite-state technology is considered the preferred model for representing the phonology and morphology of natural languages. The attractiveness of this technology for natural language processing stems from four sources: modularity of the design, due to the closure properties of regular languages and relations; the compact representation that is achieved through minimization; efficiency, which ...
متن کاملFinite-State Technology as a Programming Environment
Finite-state technology is considered the preferred model for representing the phonology and morphology of natural languages. The attractiveness of this technology for natural language processing stems from four sources: modularity of the design, due to the closure properties of regular languages and relations; the compact representation that is achieved through minimization; efficiency, which ...
متن کاملSurvey: Finite-state technology in natural language processing
In this survey, we will discuss current uses of finite-state information in several statistical natural language processing tasks. To this end, we will review standard approaches in tokenization, part-of-speech tagging, and parsing, and illustrate the utility of finite-state information and technology in these areas. The particular problems were chosen to allow a natural progression from simple...
متن کاملStandard Arabic formalization and linguistic platform for its analysis
From the beginning of the sixties, and starting with the first automatic analyzer proposed by David Cohen, one of the first theorists of NLP [1], research has continued with natural language processing and especially the automatic treatment of the Arabic language. In 1983, with a minimalist morphological analysis, based on the theory that any Arabic form is generated using root and pattern, res...
متن کاملMessage Handler
The Message Handler extracts critical intelligence information from the free-text of military messages to merge with the data'parsed from fixed-fields. In the third of four project phases, the current focus is on developing stand-alone capabilities demonstrating the viability of natural language processing in Advance Concept Technology Demonstrations (ACTDs), and to operational users at militar...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2015